MADRID CENTRAL

Social Data Analysis & Vizualisation - DTU, Spring 2022

- Laurine Dargaud, Alba Reinders Sánchez and Alejandro Valverde -

TABLE OF CONTENTS


1. Motivation

The purpose of our project is to analyze the real impact of the measure that took place in the city center of Madrid called Madrid Central.

To sum up, Madrid Central was an environment measure that entered in force the 30th of November 2018, it prohibited all polluting vehicles from entering the "Centro" district of Madrid. Unfortunately, it was taken down for political and economic reasons the 1st of July 2019.

In this study, we aim to evaluate the effectiveness of this vehicle regulation on pollution by analyzing traffic and air quality.

Figure: Map of Madrid Central low-emission zone delimitation

1. a) Our datasets

We use multiple datasets, all of them from the Open Data platform of the Town hall of Madrid.

Go to Open Data of Madrid website

1. a) i - What is our data?

The datasets we use are:

1. a) ii - Why these datasets?

We decided on these two datasets because we wanted to discover if Madrid Central achieved the objectives of reducing air pollution in the area, and we also wanted to inspect how effective and how big was the change in the traffic in and out of Madrid Central.

Additionally, we desired to analyze the relation between air quality and traffic in the city of Madrid.

1. b) End user's experience

While investigating this topic, we saw a lot of articles about Madrid Central. Some of them, saying how good and beneficial it was. Others, saying the exact opposite. Sadly, most of the news and articles were quite influenced by political ideas. This is because the measure was imposed by a political party, against the wishes of the opposing political party. Therefore, we feel they were biased, and did not rely on data and science.

We aim to fix this, trying to give an answer to how effective the measure really was, based on data, using visualizations and data analysis. We also want to present this in an informative and interactive way.


2. Basic Stats

Libraries import and constants definition

2. a) Data preprocessing

The first thing that has to be done in any data related task, is to preprocess and clean the data. In our case, this is very important, as we have as baseline many different files, with a lot of information. We have to reduce the dataset to obtain only the relevant parts.

2. a) i - Air Quality Preprocessing

Let's start by downloading all the air quality data directly from the Open Data Portal database of Madrid.

Once we downloaded the data, we need to preprocess it in order to obtain a suitable format for upcoming visualisations.

Indeed, each monthly CSV file contains one column for each hour with a associated validation column (see the Air Quality Data documentation for more details - in Spanish). Eventually we want a dataset the following columns only:

For this purpose, we need to apply the following steps on each monthly file:

  1. Drop the validation columns (we consider all values as validated to facilitate the process)
  2. Get only one HOUR column by melting the dataset
  3. Generate the datetime column
  4. Keep only the air quality station ID of each measure

Then we can concatenate all monthly dataframes into one final dataset: air_quality_data.csv.

WARNING: the following air quality data processing takes time (about 25 minutes)!

If you just want to get the final air_quality_data.csv file, set GET_FINAL_AIR_QUALITY_DATA = True only

We can then join the air quality data with the air quality stations and air information tables:

We restrict the time period until the start of COVID lockdown for not being biased, and we remove meaningless concentration values. We obtain our final dataset, ready for analysis!

2. a) ii - Traffic Preprocessing

The data we want to work with is very large, thus we need to download it from the source as it is not possible to upload it to the version control system we use (GitHub).

Before diving into the actual data, we need to contextualize. Madrid is divided into districts. There are 21 one of them, being the area of Madrid Central exactly the same as the Centro district area (thus the name).

We have a dataset of where the measure of traffic points are located. As expected, they are not evenly distributed. Our first task is to see in which district each traffic measurement point is located.

First we need to calculate the correct utm for displaying in bokeh maps.

Then we load the districts information to display them in the map.

Now, we can save in which district is each traffic point.

The next step is to load the datasets for traffic information. This datasets have a lot of rows, as each of the more than 3000 measurement points record mutiple parameters each 15 minutes, so a rough approximation of how many rows each month file has is:

$$ 30(days) \cdot 24(hours) \cdot 4(measures\_per\_hour) \cdot 3000(traffic\_points) = 8640000 $$

And once again, if we take into account that we are using data from 2016 until the end of 2021, a more accurate row count would be:

$$ 4(years) \cdot 365(days) \cdot 24(hours) \cdot 4(measures\_per\_hour) \cdot 3000(traffic\_points) = 42048000 $$

This amount of data (more than 630 million rows) is too much to handle efficiently, and obtain relevant information. To reduce the amount of rows, we decide on keeping the average intensity of traffic (Number of cars) per day in each district. That way, we will have:

$$ 4(years) \cdot 365(days) \cdot 21(number\_districts) = 30660 $$

which is more manageable number, from where we aspire to detect the relevant information in the data. Around 13714 times less data.

2. b) Exploratory data analysis

2. b) i - Air Quality Data Exploration

Map of Air Quality Stations

First let's get a better idea of air quality stations location, and the different gases they monitor. Let's plot a map to do so!

We notice that only one air quality station is found in the area of Madrid Central: it's Plaza del Carmen, that monitors SO2, CO, NO, NO2, NOx and O3.

Boxplots of gas concentrations

We can plot boxplots of gas concentrations to know how data is distributed.

We note many outliers, and obvious huge outliers for SO2 and NO. They should be recording errors from air quality sensors, so we can remove them from the data.

We can plot the boxplots again.

The most widespread pollutants are O3, NOx, NO2 and PM10.

Bar chart of average gas concentration per air quality station

Let's understand the average concentration of pollutant for each station.

Multilines chart of gas concentrations over time for given stations

Multines chart of the air quality progress based on the month of the previous year

To get a better idea of the air quality progress in a given location over time, let's plot the evolution based on the month of the previous year for recorded pollutants.

2. b) ii - Traffic Data Exploration

Map of traffic points

We want to check first how many traffic points are in each district.

From the table we can see that there is quite a big difference between the different traffic points, being the district with the least number Barajas (41) and the district with the biggest number Chamartín (357).

This is something we have to take into account, as too few datapoints may result in poor conclusions. Luckly, the Centro district has 176 measurement points, which we believe it's enough given the small area (5.23 km²) of this district.

Now let's compare the traffic intensity in each district, to see which districts are busier.

From this plot, we can see that Centro is not the most transited district. We do not know yet if it is thanks to the measures, or that it is just not a very transited district. What we do know for sure, and that can be also powered by the measures, the surrounded districts, such as Arganzuela, Retiro or Moncloa are some of the busiest districts in the city, so it will be interesting to see if there is a border effect because of Madrid Central.

If we focus more on the traffic points as separated measures instead of districts as a whole, we can observe an expected result. The traffic intensity in the main roads of Madrid is higher than in the secondary roads. We can see that, inside Madrid Central, the most crowded road is Gran Via, which is the main commercial road in the district. We also appreciate a lot of traffic surrounding Madrid Central, which again may be interesting to investigate if it is a normal traffic flow, or it is because of the measures.

Doing the average it is not enough for doing a time series analysis, thus we are going to see how does the traffic intensity per district looks like per day, from 2016 to February 2020. In order to be able to see the evolution of the traffic during time.

Given the fact that we have 21 districts, the plot becomes a bit overwhelm. But if we show only the district/s that we want to see or compare, we can perceive the progress of the traffic intensity along the days. We detect that Centro is not the district with more traffic, or that almost all the districts follow the same tendencies. When there is a day that has less traffic, we see the decrement in all the districts in proportion to its own intensity.

We suspect that these days when the traffic increases or decreases in all districts are festive or something is celebrated. Then let's try to find out if those days are related to any specific dates.

Right as we thought, almost all the peaks and valleys coincide with festive days in Madrid. Which gives us an overview of the traffic behavior. We also observe that the area of Madrid Central (Centro district) follows the same average as the whole city of Madrid.

The gray dashed lines represent the 3 key dates (start of Madrid Central, start of fines, end of Madrid Central).

Once we have seen the whole picture of the traffic intensity evolution during the period we are investigating, we want to focus on a yearly, monthly and weekly analysis between Madrid Central area and the city of Madrid.

From this first plot we see that in 2016 and 2017 the average of traffic intensity per day was higher in Madrid Central than in the rest of the city, but in 2018 it changed. Since 2018, we observe how the average from Madrid Central decreases and becomes lower than the average from the city.

Regarding the average of traffic intensity per day in a month, we see that it is more or less the same for Madrid Central and the city. Still worth noting that for some months the average from Madrid Central is a bit higher. For example during summer holidays (June, July, August, September) or Easter holidays (Match-April depending on the year) which could be related to more tourists going to the city center (Madrid Central area).

Lastly, what we can get from the weekly plot is that Madrid Central has a higher average of traffic intensity per day during the weekends than the rest of the city. Also that throughout the week the average increases progressively in both cases.

3. Data analysis

Once we have done our exploratory analysis, it is important to obtain conclusions doing a more in depth analysis of the relevant information obtained.

3. a) Air Quality Data Analysis

It would be interesting to see what is happening around Madrid Central and far from Madrid Central.

First, let's choose some stations so that we can consider the average of them.

Air quality stations monitor various gas concentrations each hour in Madrid that we can compare in the following interactive line chart.

A positive percentage refers to a gas concentration increase in comparison with the same month of the previous year. In contrast, a negative percentage refers to a gas concentration drop in comparison with the same month of the previous year, which is the aim of Madrid Central.

Inside the low-emission zone of Madrid Central (first tab), the vehicle regulation leads to a decrease or a slower increase of Nitrogen Oxides (NOx = NO2 + NO), Ozone (O3) and Sulfur Dioxide (SO2). These gases are three common pollutants emitted by cars that are highly involved in air pollution. Nonetheless, air pollution outside Madrid Central - both near and far from the low-emission zone borders, starts to decrease when fines are enforced only.

To get a better visualisation over time and space, let's have an animated map.

It can be seen that Madrid experienced periods of high pollution before the implementation of Madrid Central, e.g. from 2016 to 2018. Then, after a short transition time of some months, NO2-based air pollution remains below European standards thanks to Madrid Central. Eventually, air pollution becomes critical again as soon as Madrid Central ends.

Let's generate a final multiple bar-chart to inspect the differences in time (before MC, after inauguration, after fines application, and after the end of MC) and space (in MC area, around and far from MC area).

First, this multi-bars chart allows us to reproduce the results of a previous study that has been conducted on the same subject.

Indeed, Enrique Galdon-Sanchez et al. states that "the concentration of nitrogen dioxide (NO2), a harmful pollutant, decreased by 18.6% in the Madrid Central area". We also found the same drop percentage, with a NO2 concentration from 46.36 µg/m3 to 37.71 µg/m3. Since the European standard for NO2 concentration is 40 µg/m3, Madrid Central operation actually helped the city to stop exceeding the limit.

According to the same research paper, there is almost no border effect, since the NO2 air pollution outside Madrid Central had only a 3, increase during the vehicle regulation. We end up to the same conclusion as well.

In addition, we get a bit more in details since we split the Madrid Central into 2 periods: before and after fines enforcement. We clearly see that a low-emission zone becomes really efficient after fines enforcement, e.g. when citizens risk paying if they drive in the central district!

We also notice that not only the pollution of the central district, but the pollution of ALL Madrid districts, decreased after the application of fines. Given this observation, we can wonder whether this global trend of air pollution decrease in the city is related to weather condition rather than Madrid Central regulation. In fact, rains clean air pollution. However, according to an advanced scientific study on weather, the period of Madrid Central was not that rainy. Thus, the air pollution decrease is not weather-related, so it seems that Madrid Central had a positive effect on the whole city actually!

Besides, we show that ending Madrid Central by lifting the restrictions and authorizing vehicles back in the city center leads to a resurgence of NO2 pollution and a concentration that exceeds the European limit again...


3. b) Traffic Data Analysis

In our exploratory analysis we visualized a static map. But to see if there is a difference we have to analyze if there is a significant difference between before the regulations and after.

Ok, this map as informative as we thought it would be. We can see small changes in different districts, but nothing too significative. That is why we will try to carry out a different visualization, in an attempt to display the changes more clearly.

It seems that the measure was efficient. We can appreciate some changes, specially in the area of Gran Vía (The long road that cross the district on the North, from East to West). To better display these changes, we plot the difference between before and during, and therefore see a better outcome. Just from this data, it looks like the measure was effective, and reduced the amount of vehicles in the city center.

But we do not know yet if this is a real difference, or if it is just that in this period of time there was less traffic in all of Madrid. We need to do a further analysis to observe that.

To compare Madrid Central with the whole city of Madrid, we will do a week analysis of the traffic, to try to find if the trends before and during are the same as the ones we saw in

This seems promising, it looks like Madrid Central was indeed effective to reduce traffic inside the city center. We may be still falling in some biases tho. For example, we are not sure, as we are working with averages, if this decrease is related to it only being effective during a season of low vehicle activity.

To check this we have to display all the data we have somehow to see if this is the case, and this improvement is due to Madrid Central being effective, or it happened because any other reason.

Now, analyzing the traffic behavior during the different months of the year inside Madrid Central area, we spot that in summer (June, July, August) the traffic intensity decreases a lot, specially in August. This is something that we already expected, as well as a small decrease at Christmastime (December, January). During holiday periods, people tend to leave the city center.

We can also appreciate that the years 2018 and 2019 have a lower average of traffic intensity per month, compared with 2016 and 2017.

If we calculate the difference of the average of traffic between Madrid Central area and the other districts, we discover that in 2018 and 2019 this difference is less. Once again, as expected, the traffic is more intense on Summer in the city center respect the other districts. We can also see a huge difference in 2019 at the beginning of the year, when the measure was in place. This may indicate, once again, its effectiveness.

3. c) Relation between Air Quality and Traffic

Now that we have analyze air quality and traffic data on their own, it is important to find the raltion (if there is one) between them. The most obvious first step is to compute the correlation matrix.

CONCLUSIONS

In the light of this Madrid Central study case, we investigated whether the implementation of vehicle regulations in dense urban areas contributes to reducing pollution.

To recall our primary question: WAS MADRID CENTRAL EFFICIENT AS EXPECTED? The answer is YES!

Regarding both air quality and traffic data, we show that Madrid Central was efficient. The NO2 concentration decreased by 18.6%, and it decreases especially after the enforcement of fines for drivers. This way, Madrid ended below the European standard of NO2 as required. Traffic also decreased during the implementation period of Madrid Central, and it even gets below the traffic that is observed outside the low-emission area. Nevertheless, based on the correlation matrices we computed, air quality and traffic don't seem very correlated. Even though they are positively correlated, the correlation coefficients are lower than expected intuitively. A part of the explanation may rely on the fact that air pollution depends on many more factors than traffic: it is related to urban life, surrounding buildings, weather or people behaviors for instance.

Finally, as you know, Madrid Central regulation has been lifted due to a political reversal... We show in our study that allowing cars back in the city center lead to a return to the original pollution situation. The current Council of Madrid is in transition to apply their new environmental plan called MADRID 360.

MADRID 360 is a much more ambitious measure, but with a long-term staging, and far less restrictive than Madrid Central. It allows more vehicles inside the city center, has more parking spots, etc. It will increase the area of considered limitations zone year by year, in an attempt to reduce the traffic further, and hopefully it will.

As a summary, MADRID 360 has fewer restrictions on vehicles, but covers a much greater area.

Stay tuned, and let's hope this new plan will be effective enough, for the sake of the health of our dear Earth!


4. Genre

4. a) Our data story

Our data story is presented as a Partitioned Poster, by including as many user interactions as possible.

Referring to Segel and Heer concepts, we use a mixture of author-driven and reader-driven stories.

Indeed, we provide a "linear ordering of scenes" and we have "heavy messaging" over the webpage to explain our findings, which relates to author-driven stories. Nonetheless, we offer the possibility of free navigation to the user, and each visualisation allows "free interactivy" to the user, which makes our story more user-driven.

Table: Properties of Author-Driven and Reader-Driven Stories. Most visualizations lie along a spectrum between these two extremes.
(source: Segel and Heer's article)

4. b) Visual Narrative

Visual Structuring

[Definition] Visual structuring refers to "mechanisms that communicate the overall structure of the narrative to the viewer and allow him to identify his position within the larger organization of the visualization" (Segel and Heer).

[Our application] We use Consistent Visual Platform, as we are telling our story through a webpage that induces an implicit hierarchy of the presented sections by following the reading direction.

Figure: Strong visual hierarchy explanation (source)

Highlighting

[Definition] Highlighting refers to "visual mechanisms that help direct the viewer’s attention to particular elements in the display" (Segel and Heer).

[Our application] We use Feature Distinction as we divide our story into different sections that are clearly separated (different background colors, distinguishable title...).

Figure: How we can distinguish our sections in our webpage thanks to background color and titles

Transition Guidance

[Definition] Transition guidance concerns "techniques for moving within or between visual scenes without disorienting the viewer" (Segel and Heer).

[Our application] We use Object Continuity, as we follow the same template design for all of our sections: colors follow the same global theme and we use the same range of fonts. This way, the user can understand that he/she is jumping to another section, without getting lost.

4. c) Narrative Structure

Ordering

[Definition] Ordering refers to "the ways of arranging the path viewers take through the visualization" (Segel and Heer).

[Our application] We use User Directed Path, since "the user must select a path among multiple alternatives" to navigate in our poster. Despite the implicit user reading direction, the user is free to jump around and consult the section he/she wants, thank to the table of clickable contents. Additionally, as we have multiple tabs for some of our visualizations, the user is not requiered to follow any specific order among those tabs.

Figure: Tabs in our visualisations make our navigation user-directed

Interractivity

[Definition] Interactivity refers to "the different ways a user can manipulate the visualization" (Segel and Heer).

[Our application] We use both Hover Highlighting/Details and Filtering/Selection/Search, as we allow hovering in our plots to display additional information that the user may be interested in. We also allow in most of the plots a filtering and selection method to only show what may be relevant on a given moment via clickable legends items. Furthermore, one of the interactive maps also uses Navigation Buttons to navigate over months in the plot.

Figure: How users can interact with our visualisations

Messaging

[Definition] Messaging refers to "the ways a visualization communicates observations and commentary to the viewer" (Segel and Heer).

[Our application] We use Captions/Headlines in all our plots, to make clear to the user what is she/he looking at. For each section, we use Introductory Text, to provide a context to the user, and Summary/Synthesis for each of plots to give our interpretation of the results.

Figure: How users can interact with our visualisations

5. Visualizations

For our story, we decide on separating the different parts using sections.

For the Introduction, along with the title, we present the problem, the city of Madrid, how is it divided in districts and the complete story, summarized, of Madrid Central. We use a plot so the user gets familiar with the shape on Madrid, and can start to explore where each district is, with an interactive hover over the map. The story of Madrid is initially hidden, in case a user is not interested in reading it, and can be expanded clicking on each image.

Figure: Webpage introduction

Right after, we show our findings in a section called Focus on Data, which is divided in Air Quality, Traffic and Relation between the air quality and traffic. In this section we display all our relevant plots and findings

Inside Air Quality, we start by showing where the air quality measurement stations are located, so the user can get acquainted with the topic, and with what it is going to be discussed in the section. Below is the Comparison of gases with the previous year. This allows the user, with different tabs, to see how the gases evolved from the same date, but the year before, to display the changes in Madrid Central, Outside Madrid Central, Close to Madrid Central and Far of Madrid Central.

Figure: Air quality stations location

Figure: Comparison of gases with the previous year

After that, we relate to the european air quality standards with a short gif, where the user can stop and move frames if desired, displaying the evolution of $NO_2$ with sizes and colors, so it is more user friendly and approachable.

Figure: European air quality standards in Madrid

Using the same categories as the ones that are displayed in the map, we plot average concentrations over the Madrid Central process in different locations. These barplots allow to investigate the evolution and change in a more clear way, allowing us to obtain results and ending our analysis of air quality.

Figure: Average concentrations over the Madrid Central process in different locations

Now we get to the Traffic analysis, where we give the user an overview, showing a map with the location of all the measurement stations.

Figure: Traffic measurement stations

After that, we start focusing on the area of Madrid Central, plotting the traffic intensity in each point before and during the measures, along with a difference, to show the changes that these measures supposed.

Figure: Average traffic intensity inside Madrid Central

That visualization inform us of the evolution of Madrid Central, but, to really detect its efficacy, we need to compare it with the total traffic in the city of Madrid. For that, we use the following visualization, where we compare the weekly evolution before and during the measures, using both data from inside and outside Madrid Central. We use the same colors for each day of the week to facilitate the understanding of the plot from the user perspective.

Figure: Weekly analysis

Analyzing using averages is not enough to detect if the difference is due to outliers, or if it is because the measures really work. To show that to the user, we plot a comparison of the different years we are plotting, comparing against the average outside Madrid Central, therefore showing a more accurate evolution of the timeline.

Figure: Monthly analysis

To finish with the traffic, we display the total timeline, to show the user special keydates, and display the change using that dates as aid to understand the changes, and show if the measures really worked or not.

Figure: Traffic timeline

The last section where we use plots is in the Relation between air quality and traffic. In this section we relate the air quality with the traffic. We do not find any direct relationship between the two, and we show that to the user with a correlation matrix. We use multiple to prove that there is no relation in Madrid Central before and during, and in Madrid city as a whole.

Figure: Correlation matrices

6. Discussion

What went well?

We found useful data to investigate our idea and do our project. Then we could explore, analyze and visualize the different datasets that we found, dealing with much information. Not only the amount of data was for some datasets huge, but also the type of data was different. Changing coordinate systems, having to transform records of measurements quantified each 15 minutes from almost 5 years to a more reasonable data, that could be used to obtain visualizations and results, or working with distinct measurements of gas concentrations.

Then we explored the data, finding some insights and giving us an idea of what analysis did we want to do, along with the corresponding visualizations, to show in the webpage in an informative way.

Using the webpage format, we have been able to tell our story in a more interactive way and show it in a concise and direct way. Thanks to our visualizations, we arrived at relevant conclusions that we could not have observed if it was not because of an exhaustive analysis.

What is still missing? What could be improved? Why?

We could have incorporated more social related data, like socioeconomic data, but we believe that they are not relevant enough for our study, since we focus on discovering whether or not the environmental measure of Madrid Central reduced traffic intensity and pollution in the city center. So it is directly related to the well-being of people. Perhaps we could have used data on respiratory diseases, to see if the measure improved this condition.

We thought of using data we found about the public bicycle system established in the city, to see if there was a noticeable increase in the use of this public transport, but finally we could not include an in depth analysis of it due to time constraints.

Related to the socioeconomic data, we read a study where they used sales data of shops in Madrid Central to see how they were affected by the measure. But we did not want to prove anything like that, as it felt out of the scope of what we wanted to show.

Regarding the analysis of the data, we could have generated more charts or maps that would help us to better understand the data, or use different datasets such as weather conditions to support our conclusions, although we believe that, together the conclusions we obtain with different reports and news found, help conclude about the effectivity of Madrid Central.

We decided not to use Machine Learning in our project, as we believe it do not give any insights or new ideas that cannot be shown with the data analysis we carry out, therefore we felt it would be unnecessary to insert a machine learning model when it does not add anything to our narrative.

If we wanted to really use a machine learning method, we could have used a more complex model, to work with time series analysis and prediction, and we felt that it was outside the course content, and we wanted to use our time in tasks that were more related to what we learnt in the course.

Contribution table

*Disclaimer: All members have collaborated to a greater or lesser extent. The table shows who has been in charge of a specific part, or who has contributed more to it.*


Bibliography

THE END